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Abstract. We derive the Jeffreys prior for the parameter of the Multivariate Ewens Distribution and study some of its 
properties. In particular, we show that this prior is proper and has no finite moments. We also investigate the impact of this 
default prior on the a priori distribution of the number of species and the a priori probability of discovery of a new species, 
which are usually employed in subjective prior elicitation. The effect of the Jeffreys prior for posterior inference is illustrated 
using examples arising in the context of inference for species sampling models and Dirichlet process mixture models. 



1. Introduction 



The Multivariate Ewens Distribution (MED), also known as the Ewens Sampling Formula (ESF) (Ewens, 1972, 
Johnson et al. 1997 1, is a probability distribution on the partitions of the set {1, 2, ... , n). It appears often in genetics 
as the distribution of the number of distinct alleles in a sample of size n drawn from an infinite idealized population, or 
as the limiting distribution of other, more general models (Kingman 1978 Hoppe 1987] >. More generally, the MED 
belongs to the class of species sampling models, which describe distributions for exchangeable random partitions 
( |Aldous| 1 1 985 1 |Pitman| |1995| |Lijoi et aT| |2007| l. The MED also appears in the context of Bayesian nonparametric 
statistics, where it is related to the number of unique values in a random sample of size n taken from a random 
distribution that follows a Dirichlet process prior ( Antoniak 1974| l. Hence, in the context of Dirichlet process mixture 
models, the MED acts as the prior distribution on the number and size of the clusters imposed by model. 

This paper is concerned with Bayesian estimation and prediction under the MED model. Gamma priors, or mixtures 
thereof, are commonly used as priors in this context because of the existence of simple Gibbs sampling algorithms 
based on data augmentation (Escobar & West 1995[). Hy perparameters are elicited by either exploiting their link with 
the expected number of distinct alleles (Escobar & West|[l995) or, in the case of nonparametric mixture models, their 



link with the mean and variance of the observations (Walker & Mallick, 1997| l. Alternatively, Carota & Parmigiani 
(2002| and |Griffin & Steel] (|2004) propose eliciting priors on the probabilities of new alleles, which in turn imply 
a prior on the parameter of interest. In either case, elicitation can be difficult because of the lack of relevant prior 
information in specific applications. To deal with the lack of prior information, numerous authors have used very 
dispersed Gamma priors. In this paper we derive the Jeffreys prior associated with the MED, show that this prior is 
proper, and investigate some of its properties. Because of its invariance to transformations, the Jeffreys prior provides 
a natural "default" or "non-informative" alternative to the priors discussed above without a substantial increase in 
computational complexity. 

The remaining of the paper is organized as follows: Section|2]briefly reviews the Multivariate Ewens Distribution 
and derives the Jeffreys prior associated with its parameter. Section[3]discusses some of the properties of the Jeffreys 
prior. Section [4] presents some illustrations in the context of species sampling models and Dirichlet process mixture 
models. We conclude in Section|5]with some brief remarks and future directions. 

2. The Jeffreys Prior for the Multivariate Ewens Distribution 

Consider a partition of the set {1,2, ... ,«} into K < n subsets so that there are rj subsets of size j, where Y!)=i r j - K, 
and Y!)=i j r j - n - For example, the n elements of the original set might represent individuals being sampled from an 
infinite population, while the subsets into which they are divided could be interpreted as the species to which these 
individuals belong. The Multivariate Ewens Distribution (MED) assigns such a partition a probability 
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where < ft < oo is a parameter controlling the shape of the distribution 
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The partitions associated with the MED can alternatively be described in terms of a sequence of exchangeable 
indicators t;\, ...,%„ such that £ = k if the i individual in the population belong to species k. Assuming that the species 
are labeled consecutively between at 1 and K, and letting m k = 2" =1 = k) be the number of individuals in species 
k, then 

& \P) = P(K, mi, ... , m* | /?) = Jf } ft K f| r(m t ). 
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From Q we can compute the probability mass function associated with the species of a new individual 

(p(K,m u ...,m k + l,...,m K \ft) m k 



p(t„ +1 =k\&,. = 



p(K, mi, . . . , m*, . . . , m K \ ft) ft + n 
p(K + I, mi, . . . ,mic, . . . ,niK,\ \ ft) ft 



k<K 



p(K, mi, ... , ntk, . . . , nix | ft) ft + n 
This sequence of predictive distributions is sometimes called the Chinese restaurant process. 

We are interested in estimating the parameter ft based on either an observed sample £i , 
statistic K. The following lemma provides an expression for a natural default prior for ft. 

Lemma 1. For n > 2, the Jeffreys prior associated with ([T} and (|2]i is given by 
(3) 



k = K + l. 
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Proof. By definition, n J n (ft) oc \I(ft)\ l/2 where I (ft) = -E^ 2 log{p(K, mi , . . . ,m K \ ft)} 
associated with ft. Now, 

^- log {p(K, m x m K \ ft)}} = -^(0) + *A'0S + ri) + 

d/r j6 z 



is the Fisher information 



-E 



where if/ denotes the trigamma function ( Abramowitz & Stegun 1965 I. Now, using the facts that E{K} = T/j=o mrj 
(e.g., see 



Antoniak 



1974| > and if/ (ft + ri) = if/ '(ft) - Zp) Jphtf ( e -S- see [Abramowitz & Stegun 
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1965 



we get 



which directly leads to 



In particular, note that, for n — 2 (the smallest sample size containing information about ft), the Jeffreys prior on ft 
corresponds to a standard Cauchy prior on v = ft 1 / 2 . 

3. Properties of the Jeffreys Prior for the MED 
Surprisingly, the Jeffreys prior is proper. Indeed, note that for all n > 2 
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Although the normalizing constant C(n) = Jjj YIj=\ is not available in closed form, it is easily evalu- 

ated numerically using quadrature methods. The prior is decreasing, which can be easily verified, 



On the other hand, n J n {J3) is log-convex. Indeed, 



To show this, note that the first term is clearly positive over the support of the prior. For the second term 
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Figure [TJpresents graphs of n ] n for different values of n. 
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Figure 1. Density of the Jeffreys prior associated with the MED for some selected values of n. 



The moments of n J „{p) do not exist. Indeed, 
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for any s > 1 . Instead, consider the median of tt^(J3) as a function of n. Again, although no closed-form expression 
is available for this median, it can be computed numerically. Figure [2] suggests that the median grows more or less 
linearly with the sample size n, with Med{/3) » 0.36n + 1 for n < 400. 
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Figure 2. Median of n J n {f3) as a function of the sample size n. 



Alternatively, we can consider the impact of the Jeffreys prior on the prior distribution of the number of species by 
computing prior moments such as the prior mean, 

(n-\ 
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E{K\n} = E{E(K\/3,n)} Y^-. 

Jo {j^P + J 

and the prior variance 

The left panel in Figure[3]presents graphs for the prior mean and prior standard deviation as functions of n. Just like 
the median of n ] n (P), both of these quantities grow almost linearly with n. Moreover, note that E{K} m n/2, and that 
the prior standard deviation grows somewhat more slowly than E{K}. More generally, we can compute the marginal 
prior probability distribution on the number of species, 



1 rfy J_ 



where 
(4) 



Pr(tf = k\n) = j Pr(K = k\fi, n)n ] n (]3)Af5 = \S (n, k)\A(n, n, k) 
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and \S(n, k)\ is the absolute value of the Stirling number of the second kind. For illustrative purposes, Figure [5] shows 
the marginal distribution Pr(K = k | n = 100). The effect of the Jeffrey's prior is striking; the resulting distribution 
is U-shaped and roughly symmetric around n/2 (which is compatible with our observations about E(K \ n)). Hence, 
under this prior the model prefers a priori either a very small or a very large number of species, with values around the 
mean/median actually having very low prior probabilities. 

Yet another direction is to consider the effect of n J n {[5) on the priori probability of discovery of a new species, 
T] = PI(J3 + n + 1). In particular, Figures|3]and[3]present graphs of the expected value and the variance of rj, 

E ^ 1 n] = f (^tttt) ^ Var ^ 1 n] = f (/dbr) 2 ^ - {f fedbi) ^ )d/3 F ' 

as functions of n. Note that the values of both of these summaries are quite stable, with the expected probability of 
discovery of a new species varying between 0.38 and 0.40 over the range considered here. 
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Figure 3. Left panel: expectation and standard deviation of the number of unique alleles, E{K \ n} 
and \lax\K \ n] as a function of the sample size n. Note that they both appear to grow linearly with 

poo T 

n. Right panel: full marginal distribution p(K = k \ n) = J p(K - k \ /3)n J n (fi)df3 for n = 100. 




Figure 4. Expectation (left panel) and variance (right panel) of the probability of discovery a new 
species, E \J3/(J3 + n + 1)}, as a function of the sample size n. 

4. Illustrations 

4.1. Species sampling models. In this subsection we consider the use of our Jeffreys prior for inference under the 
MED model. We consider first a series of simulations where data is generated under a MED with true parameters 
Pt - 1 or Pt — 2. We consider two different scenarios, corresponding to n — 100 and n = 1,000 individuals, and 
for each of these scenarios we study the frequentist properties of Bayesian interval estimators generated under two 
different default priors: the Jeffreys priors derived in this paper, and a diffuse Gamma with shape parameter 0.001 and 
rate 0.001 (which has mean 1). The choice of a diffuse Gamma prior centered around (3 — 1 reflects standard practice 
in the literature. 
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Table 1 . Results of a simulation study to explore the frequentist coverage and expected length of 
100a'% credible intervals for/? generated under three different priors. 
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Table 2. Posterior credible inferences for (3 and Tj=(3/(fl + n + l) (the probability of discovery of a 
new species) under two different "non-informative" prior distributions for the sequencing tag data. 



Markov chain Monte Carlo algorithms were used to obtain samples from the posterior distribution under each prior. 
All inferences are based on 100,000 iterations of the chain obtained after a bum-in period of 5,000 iterations. For 
the Jeffreys prior, we employed a random walk Metropolis-Hasting algorithm with Gaussian proposals for log/?; the 
variance of the proposal was t 2 = 0.05 for n = 1,000 and t 2 = 1 for n — 100, which resulted in average acceptance 
rates of 70% and 60% respectively. For the Gamma prior, we employed the latent variable approach discussed in 



Escobar & West ( 1995 1, which does not require any tunning parameter. 



Table [T] presents empirical coverage probabilities and interval lengths for symmetric credible sets constructed. 
These summaries were estimated on the basis of 2, 000 randomly generated datasets. As expected, the frequentist 
coverage probability of 90% and 95% symmetric credible intervals under the Jeffreys prior seems to coincide with their 
nominal posterior probability. On the other hand, the dispersed Gamma prior produces intervals that are somewhat 
tighter than the Jeffreys prior, but which they tend to have lower empirical coverage rates (particularly, for 90% 
coverage). 

As a second illustration, we consider a real dataset discussed in Mao & Lindsay ( 2002} , Mao ( 2004 1, and Lijoi 



|et al.| ( |2007] l. The data consists of a n — 2586 randomly selected expressed sequence tags taken form a large cDNA 
library made from the mm to 3 mm buds of tomato flowers. The number of distinct tags observed in this sample is 
K = 1825. Table |2]presents estimates of the MED parameter /3 under the same two priors discussed in the simulation 
study. The same MCMC algorithms described in the simulation study were used in this analysis. Note that, although 
inferences for f3 differ somewhat among the two priors, inferences for the probability of discovery of a new species, 
r] = /3/(J3 + n + 1), are almost identical in both cases. 
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4.2. Dirichlet process mixture models. The Dirichlet process (DP) (Ferguson 1973 Antoniak 1974| Sethuraman 
1994[ ) defines a prior distribution on the space of discrete measures and has been widely used in the context of non- 
parametric Bayesian inference. However, because of the discrete nature of distributions, the DP is not typically used 
to model the data directly, but as a prior for the mixing distribution in a kernel convolution. In that case, the data 
generating process for an independent and identically distributed sample y\, . . . ,y n is assumed to be 



yi 



I 



p(yi | G)G(dff), 



G\/3~DP(J3,G ), 



where DP(J3, Go) denotes a Dirichlet process prior with centering measure Go and precision parameter /?, and p(yi\6) 
is a kernel indexed by the finite-dimensional parameter vector 9. Such a model can alternatively be described in terms 
of a series of partition indicators %\,...,i; n and component-specific parameters #2. • • • , such that 



yi 1 6,0i,02, 



p(yi I %), 



I /S-MEDOS), 



0* ~G 



where MED(y6) represents the multivariate Ewens distribution with parameter /3. 

We ran two simulation studies to investigate the impact of the (marginal) Jeffreys prior on posterior inferences 
for the DP mixture model. First, a dataset consisting of 50 observations was generated from a negative binomial 
distribution with mean 20 and variance 220, and a DP mixture of Poisson kernels with an unknown precision parameter 
(3 and a Gamma baseline measure was fitted to this data (the baseline measure was selected so that it had mean 20 and 
variance 200). Note that, because the negative binomial can be represented as a scale mixture of Poissons, the true 
data generating process corresponds to the limit of the Poisson DP mixture prior when K — n — 50 (which can be 
obtained by letting /3 — > 00). 

We considered two different prior distribution for /3, namely, the Jeffreys prior introduced in this paper and the 



"non-informative" Gam(0.001, 0.001). A variant of the collapsed Gibbs samplers described in Neal (2000) was used 
to fit the model. In the case of the Jeffreys prior, we used a version of the algorithm that integrates over /?, so that the 
posterior full conditional distribution for is given by 

A(n,n,K-) 
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k = K~ + l, 



U(«- i,n,K-y 

where A(n, m, k) was defined in Q and the negative exponent denotes the appropriate quantities computed after ex- 
cluding observation ;. 

We focused our analysis on the posterior distribution of the number of occupied mixture components K. The 
posterior mean for K was very similar in both cases, 9.1309 under the Jeffreys prior and 9.0468 under the diffuse 
Gamma prior, with configurations including more than 16 mixture components having negligible posterior probability. 
Having a number of occupied clusters that is smaller than n is not really surprising; because the Dirichlet process prior 
strongly favors clustering, we expect the model to underestimate the number of mixture components. What is really 
interesting is that the Jeffreys prior seems to favor a larger number of clusters than the Gamma prior. Indeed, the 
posterior distribution of K under the Jeffreys prior seems to be stochastically greater than the posterior distribution 
under the Gamma prior (the same phenomena appeared when we repeated the simulation study with other datasets). 
This suggest that the Jeffreys prior does a slightly better job at identifying the true number of components in this case. 

Finally, in order to evaluate whether the previous behavior is due to a systemic bias in the Jeffreys prior towards 
larger values of K, we ran a similar experiment where data was generated instead from a Poisson distribution with 
mean 20. Hence, in this case K = 1 corresponds to the truth. In this case, the posterior distribution for K under both 
models was identical (up to Monte Carlo error), so systematic bias does not seem to be present. 



5. Concluding remarks 

To the best of our knowledge, this is the first derivation of the Jeffrey's prior associated with the MED. Our numer- 
ical evaluations suggest that it might represent a reasonable default prior in situations where little prior information is 
available, including hierarchical models such as nonparametric mixture models based on the Dirichlet process. How- 
ever, the Jeffreys prior explicitly depends on the sample size observed. Hence, any statistical procedures derived under 
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this prior will depend on the stopping rule associated with the experiment; for example, the results will vary depending 
on whether data is analyzed sequentially or in batches. 
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